💽 Exercise 10A

Build an sqlite database from your month of data.

  1. Read the data into R. It’s possible to do this, because this is a small amount of data.
  2. Create an sqlite data base called “flights” from the data
  3. Do a simple query, collecting all the records for the first day of the month
# This code only needs to run it once
library(tidyverse)
library(DBI)
d <- read_csv(here::here("data", "On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2019_10.csv"))
# install.packages("RSQLite")
con <- dbConnect(RSQLite::SQLite(), "data/:flights:")
copy_to(con, d, "flights", 
  temporary = FALSE, 
  indexes = list(
    c("FlightDate", 
    "Reporting_Airline", 
    "Tail_Number",
    "Origin",
    "Dest"
  )
))
flights_db <- tbl(con, "flights")
feb1 <- flights_db  %>%
  filter(DayofMonth==1) %>% 
  select(DayofMonth, Origin, Dest, DepTime, ArrTime,
         IATA_CODE_Reporting_Airline, Tail_Number) %>%
  collect()
feb1

🧮 Exercise 10B

Explore your data! These tasks should be done using the dplyr interface, so that the tidy wrangling verbs can be used instead of raw SQL functions.

  1. Find the carrier that had the most flights during the month. Determine which carrier this is?
## # A tibble: 5 x 2
##   IATA_CODE_Reporting_Airline      n
##   <chr>                        <int>
## 1 WN                          116465
## 2 DL                           86322
## 3 AA                           81028
## 4 OO                           72350
## 5 UA                           55074

WN = Southwest had the most flights in February 2020.

  1. Which airport had the most departing traffic?
## # A tibble: 5 x 2
##   Origin     n
##   <chr>  <int>
## 1 ATL    33643
## 2 ORD    30125
## 3 DFW    26396
## 4 DEN    22752
## 5 CLT    20569

ATL = Atlanta had the most departing flights.

  1. Compute the smallest, largest and median departure delay for the busiest airport. What would it mean if the median departure delay was negative?

ATL, Atlanta, is the busiest airport. The smallest departure delay for the month was -22 minutes, which means the flight left early, quite early. The longest delay was 896 minutes, almost a day delay. The median delay was -2, which is less than 0. That means that 50% of the flights left before scheduled.

  1. Make a side-by-side boxplot of the delays for each carrier, at the busiest airport.
    1. Think about transforming delay because it has a skewed distribution. (If you use a transformation on the axis, check the number of missings. It may be that a lot of data is excluded and you need to do the transformation with mutate.)
    2. Sort the carrier axis by the median delay (this is tricky! Hint: use the forcats package).
    3. Make nice labels on the axis
    4. Write a paragraph on what is learned about the delays by carrier

Overall, there is not much difference in the median delays, and the variation in delays between carriers. Southwest has the highest median delay and Republic Airlines has the lowest median delay. Delta, which has this as the hub, has a small interquartile range in delays but a large number of early flights as well as many delayed flights.

  1. How many records, of the busiest airport, have missing values for departure delay?
## # A tibble: 2 x 2
##   `is.na(DepDelay)`     n
##   <lgl>             <int>
## 1 FALSE             33586
## 2 TRUE                 57
  1. Is there a ghost flight in your month of data? What code would you use to find this? Yep, there are a lot of ghost flights!
df <- tbl(con, "flights") %>%
  select(Tail_Number, Origin, Dest, FlightDate, DepTime) %>%
  filter(!is.na(Tail_Number)) %>%
  filter(!is.na(DepTime)) %>%
  mutate(DepTime = as.numeric(DepTime)) %>%
  arrange(Tail_Number, FlightDate, DepTime) %>%
  collect()
df %>% count(Tail_Number, sort=TRUE)
aircraft <- df %>%
  select(Tail_Number) %>%
  distinct()
for (i in 1:nrow(aircraft)) {
  d <- df %>% filter(Tail_Number == aircraft$Tail_Number[i]) 
  if (nrow(d) > 1) {
    for (j in 2:nrow(d)) {
      if (d$Origin[j] != d$Dest[j-1]) {
        cat(d$Tail_Number[j], d$Dest[j-1], d$Origin[j],  "\n")
      }
    }
  }
  cat(i, "\n")
}

📍 Exercise 10C

Here we are going to add a new table with airport information, and use this to make a map of flights.

  1. Read the airport location data into R, and add a table to your database.
airports <- read_csv(here::here("data","402312038_T_MASTER_CORD.csv")) %>%
  select(-X29)
copy_to(con, airports, "airports", 
  temporary = FALSE
)
  1. Plot the locations on a map. You should filter the airports to only the latest location. Airports sometimes move 🤭. An Open Street Map can be downloaded using the get_map() function in the ggmap package.
## [1] "airports"     "flights"      "sqlite_stat1" "sqlite_stat4"

  1. Now the fun part, lets take a day’s worth of flights, and plot all the flights. You will need to join the day of flights data with the airport locations, using both the origin and destination.

  1. Choose the two major carriers for your day of data, and make two separate maps of flights, one for each carrier. Compare and contrast the carrier flight patterns.

I chose Delta and Southwest. It looks a little like Delta has more of a hub system, and Southwest is more distributed, serving many more airports.

  1. ADVANCED: Now we are going to examine change in patterns over the course of a day. You will need to convert departure time into a standard time. Then break it into one of four categories: midnight-6am, 6am-noon, noon-6pm, 6pm-midnight. Using all the carriers again, make separate maps for each quarter of the day. Compare the traffic over these four time blocks.

There’s not a lot to see in four big groups like this. Its an exercise in working with time. And also in ordering the four groups appropriately.

  1. ADVANCED: Use the standardised times to follow the path of one plane during the day.